Skip to content

[BEAM-1157] Add HBaseIO#1961

Closed
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:BEAM-1157-HBaseIO
Closed

[BEAM-1157] Add HBaseIO#1961
iemejia wants to merge 1 commit intoapache:masterfrom
iemejia:BEAM-1157-HBaseIO

Conversation

@iemejia
Copy link
Member

@iemejia iemejia commented Feb 9, 2017

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

@iemejia
Copy link
Member Author

iemejia commented Feb 9, 2017

R: @dhalperi

@coveralls
Copy link

Coverage Status

Coverage remained the same at 69.752% when pulling 233c099 on iemejia:BEAM-1157-HBaseIO into e21f9ae on apache:master.

@iemejia
Copy link
Member Author

iemejia commented Feb 10, 2017

Rebased to fix error produced by the change of Create.of() => Create.empty

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 2505d0b on iemejia:BEAM-1157-HBaseIO into ** on apache:master**.

@jbonofre
Copy link
Member

R: @jbonofre

@jbonofre
Copy link
Member

First of all the tests have to be fixed. I gonna take a look.

@iemejia
Copy link
Member Author

iemejia commented Feb 10, 2017

There is something fishy going on, I did mvn clean verify in my machine and everything passes. We need to check if there is something stopping the HBaseTestUtility from been instantiated in jenkins, because the tests that fail are the ones that use the embedded instance.

@dhalperi
Copy link
Contributor

dhalperi commented Feb 10, 2017

R: -@jbonofre -- I have already begun a substantial review of this one. Save your time and enjoy your vacation.

@dhalperi
Copy link
Contributor

@iemejia I suggest that the precommit bugs are at the least identifying places you can improve error catching. Try adding more error reporting so you can see where things are going wrong.

@iemejia
Copy link
Member Author

iemejia commented Feb 10, 2017

Yes sir ! the issue is that almost all the bugs are null pointer exceptions in the objects that should have been initialized in @BeforeClass, I will log initialization to check if this is going well indeed.

Copy link
Contributor

@dhalperi dhalperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass over read.

<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>${project.version}</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop -- should be inherited

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<properties>
<hbase.version>1.3.0</hbase.version>
<hbase.guava.version>12.0.1</hbase.guava.version>
<hbase.protobuf.version>2.5.0</hbase.protobuf.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shade all these dependencies for HBaseIO module itself. Don't leak protobuf or guava dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, I doubted, but since this is the current policy I will fix it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

<artifactId>commons-lang</artifactId>
<version>2.6</version>
<scope>test</scope>
</dependency>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shade any of these you do not need to expose on the public API surface

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is scoped to test so it should not leak. I don't think we need to shade it. Or do you have a strong reason to do it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, you're right, I missed the test

* .withTableId("table"))
* .withScan(scan));
*
* // Scan a prefix of the table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"prefix" not quite the right word I think, but good.

Copy link
Member Author

@iemejia iemejia Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

* <pre>{@code
* Configuration configuration = ...
* PCollection<KV<ByteString, Iterable<Mutation>>> data = ...;
* data.setCoder(HBaseIO.WRITE_CODER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

;?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best for the end, this one I really need you to give me some hints.
Notice that to make the API similar to Bigtable, I let this Mutation there, however HBase mutation is not Serializable. So I need to encode it to pass it, but I didn't find how to do it, because if the user is doing it externally he can use a CoderRegistry or do the setCoder in the PCollection.
I don't know how to solve it otherwise because I would need something like the opposite of getDefaultOutputCoder. I mean, like getDefaultInputCoder, and I didn't find this in the API or is there any other way to do this that I am missing, can you give me some ideas ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I literally meant "there's a missing ;".

@kennknowles filed a JIRA recently about registering coders dynamically. As is, something like this may be required, or of course the user adds custom configuration to the pipeline coder registry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe ok, can you please refer me to the JIRA number to start watching it and improve this once it is solved.

if ((scanWithNoLowerBound
|| isLastRegion || Bytes.compareTo(startRow, endKey) < 0)
&& (scanWithNoUpperBound || Bytes.compareTo(stopRow, startKey) > 0)) {
regionLocations.add(regionLocation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks right. Commenting for myself later that I need to look at how well this is tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you didn't use the ByteKey and ByteKeyRange functions like overlap and contains to simplify this logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the begining I didn't want to use both ByteKey and ByteKeyRange because I thought they were related to Bigtable not Beam. I propose you to change this logic afterwards, in a subsequent PR, maybe with the subsplits work, I don't really want to change the current logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, ByteKey and ByteKeyRange are not Bigtable-specific -- they're specifically added to provide a standard set of common and tricky logic around keys that are byte[].

|| Bytes.compareTo(startKey, startRow) >= 0) ? startKey : startRow;
final byte[] splitStop =
(scanWithNoUpperBound || Bytes.compareTo(endKey, stopRow) <= 0)
&& !isLastRegion ? endKey : stopRow;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you didn't use the ByteKey and ByteKeyRange functions like overlap and contains to simplify this logic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

LOG.debug("Closing reader after reading {} records.", recordsReturned);
if (scanner != null) {
scanner.close();
scanner = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this also close the connection allocated in start?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, excellent catch, fixing right now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@AfterClass
public static void afterClass() throws Exception {
admin.close();
htu.shutdownMiniCluster();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should always look like if (admin != null) { admin.close(); admin = null; } that will make your tests more resilient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

CoderRegistry cr = TestPipeline.create().getCoderRegistry();
cr.registerCoder(Mutation.class, HBaseMutationCoder.class);
} catch (IOException e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this instead of throw?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.04%) to 69.765% when pulling 8c1d79f on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.01%) to 69.713% when pulling 265d9af on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 13, 2017

@dhalperi
Copy link
Contributor

dhalperi commented Feb 13, 2017 via email

@iemejia
Copy link
Member Author

iemejia commented Feb 13, 2017

Arrghh I was lucky to be close to 1 test to pass :)
I am going to try something as you said, this one looks similar, so probably the variable will make it.
https://issues.apache.org/jira/browse/HBASE-11711

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks better. Maybe the binding host as a static var is better.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 69.686% when pulling a6584f4 on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 13, 2017

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 69.903% when pulling dcfeac0 on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 15, 2017

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 69.9% when pulling 3db8c1f on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 15, 2017

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 69.91% when pulling 6914b2c on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 15, 2017

public static void beforeClass() throws Exception {
LOG.info("Starting HBase Embedded Server (HBaseTestUtility)");
System.out.println("Starting HBase Embedded Server (HBaseTestUtility)");
conf.setInt(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that ultimately what you want is to, in your configuration, set the HBase server name to localhost or 127.0.0.1

@jbonofre
Copy link
Member

Now that I'm back from vacation, I will help for the review with Dan.

@asfbot
Copy link

asfbot commented Feb 17, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/7534/

Build result: FAILURE

[...truncated 6325 lines...] at hudson.remoting.UserRequest.perform(UserRequest.java:153) at hudson.remoting.UserRequest.perform(UserRequest.java:50) at hudson.remoting.Request$2.run(Request.java:332) at hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Caused by: org.apache.maven.plugin.MojoFailureException: You have 3 Checkstyle violations. at org.apache.maven.plugin.checkstyle.CheckstyleViolationCheckMojo.execute(CheckstyleViolationCheckMojo.java:588) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134) at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208) ... 31 more2017-02-17T09:25:55.980 [ERROR] 2017-02-17T09:25:55.980 [ERROR] Re-run Maven using the -X switch to enable full debug logging.2017-02-17T09:25:55.980 [ERROR] 2017-02-17T09:25:55.980 [ERROR] For more information about the errors and possible solutions, please read the following articles:2017-02-17T09:25:55.980 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException2017-02-17T09:25:55.981 [ERROR] 2017-02-17T09:25:55.981 [ERROR] After correcting the problems, you can resume the build with the command2017-02-17T09:25:55.981 [ERROR] mvn -rf :beam-sdks-java-io-hbasechannel stoppedSetting status of 3e9c921 to FAILURE with url https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/7534/ and message: 'Build finished. 'Using context: Jenkins: Maven clean install
--none--

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 69.214% when pulling a7d97a1 on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 17, 2017

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 69.221% when pulling cbf2933 on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@dhalperi
Copy link
Contributor

I took a pass on the HBase Configuration in #2036. The changes I made to the test seem to have worked -- the most recent Jenkins run number 7544 actually tested and passed HBaseIOTest. [Jenkins build failed overall because of the awful hacks I did to the poms to streamline testing.]

Please take a look -- I hope this unblocks you getting this green!

@dhalperi
Copy link
Contributor

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 69.231% when pulling 8eb46f1 on iemejia:BEAM-1157-HBaseIO into caaa64d on apache:master.

@asfbot
Copy link

asfbot commented Feb 17, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/7556/
--none--

@iemejia
Copy link
Member Author

iemejia commented Feb 17, 2017

Finally it passed !
Thanks a lot than for the help to fix it, and for the review. Now it is up to you to tell me if I need to fix something else or we are ready to go.

Copy link
Contributor

@dhalperi dhalperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally lgtm!

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
</plugin>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of these should be inherited. Are they needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be put explicitly, all the other IOs do the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think they are needed. Please look at https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/pom.xml

The only place we have them is in pluginManagement, to override config from parent.

Copy link
Member Author

@iemejia iemejia Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have prepared a subsequent PR to remove the need to import these three maven plugins from all the IOs, but I am hoping for this one to get merged to open it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, clearly I'm crazy. Thank you.

for (Result result : scanner) {
results.add(result);
}
System.out.println(estimatedBytesSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop println throughout.

htu = new HBaseTestingUtility(conf);
htu.startMiniCluster(1, 4);
admin = htu.getHBaseAdmin();
System.out.println("Started HBase Embedded Server (HBaseTestUtility)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop println.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(LOG okay)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that was a mistake from trying to do jenkins happy, thanks for noticing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

/**
* This is just a wrapper class to serialize HBase {@link Scan}.
*/
public class SerializableScan implements Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these intended for use outside of the hbase module, or only inside of the HBaseIO module? If the latter, I would not make them public.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not, however as they live in the coders package, I have to make them public, otherwise I cannot import them. One approach would be to move them to the same directory but I think this makes the code more messier (even if I agree with you about the desired access level).

@iemejia
Copy link
Member Author

iemejia commented Feb 18, 2017

Ok, fixed last remarks,and rebased into one single commit, I hope it is ok right now so it can be merged soon.
Thanks again for the review and the patience Dan.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.003%) to 69.197% when pulling 1aee6cc on iemejia:BEAM-1157-HBaseIO into 01de255 on apache:master.

@asfbot
Copy link

asfbot commented Feb 18, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/7582/
--none--

@iemejia
Copy link
Member Author

iemejia commented Feb 20, 2017

Notice that this extra rebase was to add the IO to the parent pom and the javadoc pom.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 69.163% when pulling a1ad2a4 on iemejia:BEAM-1157-HBaseIO into 4e5a762 on apache:master.

@asfbot
Copy link

asfbot commented Feb 20, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Java_MavenInstall/7598/
--none--

@jbonofre
Copy link
Member

I already updated the other IOs. So, please, don't update the other IO in another PR.

@asfgit asfgit closed this in ed0d457 Feb 22, 2017
@dhalperi
Copy link
Contributor

Merged! Can you file JIRAs for follow-on work (adding support for Dynamic work rebalancing, improving splitting, switching to ByteKey, ByteKeyRange, ByteKeyRangeTracker to help the first 2)?

@iemejia
Copy link
Member Author

iemejia commented Feb 22, 2017

Awesome, thanks a lot Dan for your review + the binding fix, I will be filling the JIRAs now.
Thanks a lot also JB for your feedback.

@iemejia iemejia deleted the BEAM-1157-HBaseIO branch February 22, 2017 08:19
@iemejia
Copy link
Member Author

iemejia commented Feb 22, 2017

JB, as discussed, I will do the update to remove extra plugins for the other IOs, thanks for letting me that one.

@jbonofre
Copy link
Member

Sure ! Thanks ! I set my branch "on hold" ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants